ISOMORPHISM AND SYMMETRIES IN RANDOM 
PHYLOGENETIC TREES 



MIKLOS BONA AND PHILIPPE FLAJOLET 



Abstract. The probability that two randomly selected phylogcnetic trees of 
the same size are isomorphic is found to be asymptotic to a decreasing exponen- 
tial modulated by a polynomial factor. The number of symmetrical nodes in a 
random phylogenetic tree of large size obeys a limiting Gaussian distribution, 
in the sense of both central and local limits. The probability that two random 
phylogenetic trees have the same number of symmetries asymptotically obeys 
an inverse square-root law. Precise estimates for these problems are obtained 
by methods of analytic combinatorics, involving bivariate generating functions, 
singularity analysis, and quasi-powers approximations. 



1. Introduction 

Every high school student of every civihzed part of the world is cognizant of the 
tree of species, also known as the "tree of life" , in relation to Darwin's theory of 
evolution (Figure [ij. We observe n different species, and form a group with the 
closest pair (under some suitable proximity criterion), then repeat the process with 
the n — 2 remaining species together with the newly formed group, and so on. In 
this way a phylogenetic tree, also known as "cladogram", is obtained: such a tree 
has the n species at its external nodes, also called "leaves" ; it has n ~ 1 internal 
binary nodes, and it is naturally rooted at the last node obtained by the process. 
Note that, by design, there is no specified order between the two children of a binary 
node. 

Seen from combinatorics, the phylogenetic trees under consideration are thus 
trees in the usual sense of graph theory (i.e., acyclic connected graphs ^ §1-5]); in 
addition, a binary node is distinguished as the root, and each node has outdegree 
either (leaf) or 2 (internal binary node) . Finally, the leaves are labeled by distinct 
integers, which we may canonically take to be an integer interval In classical 

combinatorial terms, the set of phylogenetic trees thus corresponds to the set B of 
rooted non-plane binary trees, which are labeled at their leaves. 

We let Bn be the subset of B corresponding to trees of size n (those with n leaves) 
and denote by 6„ := \Bn\ the corresponding cardinality. Considering the listing of 
all unlabeled trees of sizes 1,2,3,4 



(1) □ 

(L) (i?) 




the reader is invited to verify that hi = 1, 62 = 1, 63 = 3, and that 64 — 15 
is obtained by counting all possible labelings (3 and 12, respectively) of the two 
trees L, R shown on the right of ([l]). 

A general formula for the numbers 6„ is well known and straightforward to prove. 
Indeed, if we introduce the exponential generating Junction 

B{z) ■.= J2b^-, 

n>l 
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Figure 1. Left: the representation of a pylogenetic tree in Darwin's 
own handwriting. Right: an illustration of the Tree of Life by Haeckel 
in The Evolution of Man, published in 1879. (Source: Entry "Tree of 
life", Wtkipedta.) 

then the fact that each element of S„ is built up from its two subtrees implies that 



See the books by Stanley [2Qj pp. 13-15] or Flajolet-Sedgewick [9l §2.5] for details 
and related results. So, B{z) is the solution of the quadratic equation ([2]) that is a 
generating function. That is, 

B{z) = 1 - Vl - 2z. 
This leads to the following exact formula for the numbers 6„. 
Proposition 1. The number of phylogenetic trees on n labeled nodes is 



There is a natural way to associate an unlabeled rooted binary non-plane tree to 
each element i e y8„, by simply removing all the labels of t. We will say that two 
elements t, t' g S„ are isomorphic if removing their labels will associate them to 
the same unlabeled tree. This leads to the following intriguing question. 

Question. What is the probability p„ that two phylogenetic trees, 
selected uniformly at random in Bn, are isomorphic? 

Note that , in our running example , the case of rt = 4 , we have P4 — = 
Indeed, if we selected two elements of Ba at random, there is a (3/15)^ = (V^)^ 
chance that they will both belong to the isomorphism class of L, and (12/15)^ = 
(4/5)^ that they both belong to the isomorphism class of R, where L and R are the 
two trees of ([l]). 



(2) 



B{z)^z+-B{z)\ 



'n 



1 •3- •••(2n-3) = (2n-3)!!. 
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In this paper, we will use a multivariate generating function argument (Section[2]) 
in conjunction with an analysis of singularities in the complex plane (Section [s]) to 
answer the isomorphism question in Theorem [T] In Section [4] we will extend our 
analysis to distributional estimates of the number of symmetrical nodes in phylo- 
genetic trees and in their unlabeled counterparts, known as Otter trees: see Theo- 
rems [2] and |3] for central and local limit laws, respectively. Such results in particular 
quantify the distribution of the log-size of the automorphism group of the random 
trees under consideration. In Section |5] we will work out an explicit estimate of the 
probability that two random trees have the same number of symmetries. 

2. Isomorphism: a Generating Function Argument 

2.1. Unlabeled Trees. Let Un be the set of all unlabeled rooted binary non-plane 
trees with n leaves, and let w„ = \h(n\ be the corresponding count, with ordinary 
generating function 

U{z) := Vzi„z". 



n>l 



Such trees are often called Otter trees, since Otter was the first to study their 
enumeration [T7] . We can build a generic element of Un by taking a tree t' £ Uk 
and a tree t" € Un-k, and joining their roots to a new root. As the order of t 
and t' is not significant, we get each tree t G Un twice this way, except that, if the 
two subtrees of t are identical, we get t only once. This leads to the functional 
equation H [H [T71 HH] : 

(3) U{z)^z+\{U{zY + U{z^)). 

The numbers m„ are listed as sequence A001190 (the "Wedderburn-Etherington 
numbers") in the On-line Encyclopedia of Integer Sequences by Neil Sloane [19] 
and are the answers to various combinatorial enumeration problems. The first few 
values of the sequence are 1, 1, 1, 2, 3, 6, 11, 23, 46, 98. 

2.2. A multivariate generating function. Let ti e S„, and let t2 £ Bn- By 

Proposition [l] there are (2n — 3)!!^ possibilities for the ordered pair (ii, ^2), where ti 
and t2 do not have to be distinct. Our goal is to count such ordered pairs in which 
ti and ^2 are isomorphic. This number, divided by (2n — 3)!!^ will then provide the 
probability p„ that two randomly selected elements of Bn are isomorphic. 
Let t €Un- Then the number of different labelings of the leaves of t is 

where sym(t) is the number of non-leaf nodes w of i such that the two subtrees 
stemming from v are identical. For example, if n = 4, and t is the tree L of {[ij, 
then we have w{t) = 3, and indeed, t has n!/2^ = 24/8 = 3 labelings. If t is the 
tree i? of Q, then we have w{t) = 1, and t has 24/2 = 12 labelings. 
Isomorphism classes within Bn correspond to elements of Un ■ Set 

(5) Wn=y ^TTV- 

tew„ 

As we have mentioned above, nl/2*'^™'^*^ is the number of labeled trees in the iso- 
morphism class corresponding to t. Summing this number over all isomorphism 
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classes, we obtain the total number of trees in S„. That is, 

nlWn = 1 • 3 • • •• (2n-3)!!. 
For instance, = s + ^ = 1' ^"^^ 4! • | = 15 = 5!!. 



8 

Let 



y«ym(t)^|t| 



(6) F{z,u)^Y.' 

teu 

be the bivariate generating function of Otter trees, with z marking the number of 
leaves, and u marking non-leaf nodes with two identical subtrees. In particular, 
F{z,u) — ^ + uz^ -t-M2:'^ + (u^ + higher degree terms. The crucial observation 
about F{z,u) is the following. 

Lemma 1. The bivariate generating function F{z,u) that enumerates Otter trees 
with respect to the number of symmetrical nodes satisfies the functional equation 

(7) F{z, u) = z+ ^F{z, u)2 + (^^ - F{z^u^). 

Proof. If a tree consists of more than one node, then it is built up from its two 
subtrees. As the order of the two subtrees is not significant, we will get each tree 
twice this way, except the trees whose two subtrees are identical. If ti and t2 are 
the two subtrees of t whose roots are the two children of the root of t. then 



sym(t) 



sym(ii) +sym(t2), if ti and t2 are not identical 

sym(ti) + sym(t2) + 1, if and ^2 Sire identical. 

The first term of the right-hand side of Q represents the tree on one node, the 
second term represents all other trees as explained in the preceding paragraph, and 
the third term is the correction term for trees in which the two subtrees of the root 
are identical. □ 

Note that various specializations of F(z, u) have a known combinatorial meaning. 
Indeed, 

(j) If u = 1, then F{z, 1) ~ X^tec/ simply the ordinary generating func- 

tion U{z) of Otter trees with respect to their number of leaves. We have 
discussed this generating function in Subsection |2.1[ and mentioned that 
its coefficients u„ are the Wedderburn-Etherington numbers, which form 
sequence A001190 in ;I9]. 
(a) If u = 2, then F{z, 2) = 'J2teu 2:'*'2*'y™*^*'' is the ordinary generating function 
of the total number of automorphisms in all Otter trees. The coefficients 
constitute sequence A003609 in [IF. Interested readers may consult McK- 
eon's studies (THHnj for details. The first few elements of the sequence are 
1, 2, 2, 10, 14, 42, 90, 354. 
(Hi) If u = 1/2, then 

P ^) = E ^l*l2-^''-(*) = W^"^" = E(2" - 3)!! J, 

^ ^ teu n n 

is the exponential generating function B{z) of labeled trees in disguise. We 
have discussed this generating function in the Introduction. The numbers 
{2n - 3)!! form sequence A001147 in [15] . 
It is more surprising that the substitution u = 1/4 will give us the answer we 
are seeking. Let [z"](7(z) denote the coefficient of z" in the power series g{z). 
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Lemma 2. For all positive integers n > 2, the probability pn that two phylogenetic 
trees of size n are isomorphic satisfies 

Proof. Consider the sample space whose elements are the elements of ZY„, and in 
which the probability of i € lAn is 

n\ 1 ''J-^i't) 



(8) 



2«y'°(t) (2n-3)!! (2n-3)!!' 

(For probabilists, k is the image on Un of the uniform distribution of For 
instance, if n = 4, then this space has two elements, (the two trees L,Rof Q), one 
has probability 1/5, and the other has probability 4/5. If we select two elements of 
this space at random, the probability that they coincide is 



teu„ 



(2n-3)!!2 ^ V4 



Our claim now follows since J2t&u (I) indeed the coefficient of z" in 

F{z, 1/4), in accordance with the definition ([6|. □ 

3. Isomorphism: Singularity Analysis 

By Lemma [2] our goal is now to find the coefficient of z" in the one- variable 
generating function 

f{z) := F{z,l/A). 

Lemma[l]shows that the formal power series F{z, u) is the solution of the quadratic 
equation Q that satisfies i^(0,0) = 0. That is, 

(9) F{z, u) = 1 - Vl - 2z - (2u- l)F(z2,u2). 

Iterated applications of ([9]), starting with u = 1/4, show that 



/(z) = F(z,l/4) 



'l -2z+ -i^ ( z2,— 
2 I ' 16 



^-2z~\ 1^2z^+'^f(z\ — 
2 2 V 8 V 256 



In the limit, there results that f(z) admits a "continued square-root" expansion 



f(') = 1 



\ 



2z- 



15 



-2z2 



7 /255 

8 V 128 



2z4_ 



127 
128 



1, 1, 1, 



out of which initial elements of the sequence (pn) n>i are easily determined: 
17 3 5 13 1385 17861 101965 
25' 7' 21' 99' 20449 ' 511225' 5909761 ' 
In order to compute the growth rate of the coefficients of f{z), we will analyze the 
dominant singularity (or singularities) of this power series. The interested reader is 
invited to consult the book Analytic Combinatorics by Flajolet and Sedgewick |5] 
for more information on the notions and techniques that we are going to use. Part 
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of the difficulty of the problem is that the functional relation ^ has the character 
of an inclusion-exclusion formula: F{z,u) does not depend positively on F{z^,v?), 
as soon as u < 1/2, which requires suitably crafted arguments, in contrast to the 
(simpler) asymptotic analysis of u„ = 1). 

Briefly, we are interested in the location, type, and number of the dominant sin- 
gularities of f{z), that is, singularities that have smallest absolute value (modulus). 

3.1. Location. First, it is essential for our analytic arguments to establish that 
f{z) has a radius of convergence strictly less than 1. Our starting point parallels 
Lemmas 1-2 of McKeon [l^ , but we need a speciflc argument for the upper bound. 

Lemma 3. Let p be the largest real number such that f(z) is analytic in the interior 
of a disc centered at the origin that has radius p. The following inequalities hold: 

0A< p< 0.625. 

Proof, (i) Lower bound. Note that /(z) is convergent in some disc of radius at least 
0.4, since the coefficients of f{z) = F{z, 1/4) are at most as large as the coefficients 
of F{z, 1), the generating function U{z) of Otter trees, and the latter is known to 
be convergent in a disc of radius 0.40269 • • • : see Otter's original paper [IT] and 
Finch's book [6, §5.6] for more details on the asymptotics of F{z, 1) = U{z). 

(m) Upper bound. For fixed n, let ai, a2, • • • , be the numbers of our labeled 
trees whose underlying unlabeled tree is the first, second, . . . , last Otter tree of size 
n. Then the relation 

(10 Pn = 7 ^ ^ ^ ^ > — , 

(ai + a2 H l-au„) 

results from the Cauchy-Schwarz inequality. (In words: the probability of coinci- 
dence of two elements from a finite probability space is smallest when the distribu- 
tion is the uniform one.) 

As we mentioned, it is proved in jTT] that the generating function J2n ^n^;" 
converges in a disc of radius at least 0.4. Therefore, the series J2n ;r~^" converges 
in a disc of radius at most 1/0.4 — 2.5, and by (lOl, this implies that XlnP"^" 



converges in a disc of radius less than 2.5. Now Lemma [2] shows that F(z,l/4) 
is convergent in a disc of radius less than 2.5/4 = 0.625, since the coefficients of 
F( 2,1/4) are, up to polynomial factors, 4" times larger than the coefficients of 
T,^Pnx". It follows that p < 0.625. □ 

A well-known theorem of Pringsheim states that if a function g{z) is representable 
around the origin by a series expansion that has non- negative coefficients and radius 
of convergence R, then the real number R is actually a singularity of g{z). Applying 
this theorem to f{z), we see that the positive real number p must be a singularity 
of/(z). 

3.2. Type. Recall that a function g{z) analytic in a domain is said to have 
a square-root singularity at a boundary point a if, for some function H analytic 
at 0, the representation g{z) = H{^z — a) holds in the intersection of VI and a 
neighborhood of a. (In particular, if g{z) — \/^{z) with 7 analytic at a, then g{z) 
has a square-root singularity at a whenever 7(a) = and 7'(a) ^ 0.) 

Lemma 4. All dominant singularities (of modulus p) of f{z) are isolated and are 
of the square-root type. 
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Proof. In order to see this, note that p < 1 (proved in Lemma [3| imphes that 
P < \/P- Therefore, the power series F(z^,l/4) (that has radius of convergence 
^yji) is analytic in the interior of the disc of radius p, and so is the power series 
F{z'^, 1/16) since its coefficients are smaller than the corresponding coefficients of 
F{z'^, 1/4). Consequently, Equation ^ implies that the dominant singularities of 

(11) /(.) =^{4)-'- f^^^^H^ 

are of the square-root type: they are to be found amongst the roots of the expression 
under the square-root sign in that is, amongst the zeros of 1 — 2^;-|- ^F{z'^, 1/16) 
that have modulus p. As 1 — 22: -I- ^F{z'^, 1/16) is analytic in the disc centered at 
the origin with radius at least > p, it has isolated roots. Hence f{z) has only 
a finite number of singularities on the circle \z\ = p, and each is of square-root 
type. □ 



The argument of the proof (see (111) also shows that p is determined as the 
smallest positive root of the equation 

(12) l-2p+lF(p^l)=0. 

3.3. Number. In order to complete our characterization of the dominant singular 
structure of f{z), we need the following statement. 

Lemma 5. The point p is the only singularity of smallest modulus of f{z). 

Proof. The argument is somewhat indirect and it proceeds in two stages. 

First we show that, as a power series, f{z) converges for each z with \z\ — p. 
To this purpose, we need to recall briefly some principles of singularity analysis, as 
expounded in [5J Ch. VI]. Let g{z) be a function analytic in \z\ < R with finitely 
many singularities at the set {ctj} on the circle \z\ = R; assume in addition that 



g{z) has a square-root singularity at each aj in the sense of Subsection 3.2 Then, 
one has [z'^]g{z) — O {R^^rV^^^) . (This corresponds to the O-transfer theorem 
of [HI Th. VI. 3, p. 390], with amendments for the case of multiples singularities to 
be found in §VI.5]; see also ( [l4| ) below.) It follows from this general estimate 
and Lemma |4] that 

[z"]/(z) = 0(p-"n3/2). 
Therefore, the series expansion of f{z) converges absolutely as long as \z\ < p, and, 
in particular, it converges for all z with modulus p. 

Now, we are in a position to prove that f(z) has no singularity other than p on 
the circle \z\ = p. Let us assume the contrary; that is, there is a real number zq ^ p 
such that \zo\ — p and zo is a singularity of f{z) = F{z, 1/4). Then, it follows 
from ([9]) that /(zq) = -^(^0; 1/4) = 1, since the expression under the square-root 
sign in (|9| is equal to 0, corresponding to a singularity of square-root type. On 
the other hand, one has a priori |/(zo)| < f{p)i as a consequence of the triangle 
inequality and the fact, proved above, that f{z) converges on \z\ = p. Now it follows 
from the strong triangle inequality that the equality /(^o) = fip) is only possible 
if all the terms /„Zq that compose the (convergent) series expansion of /(zq) are 
positive real. (Here — [z"^]f{z).) However, since, in particular, /i = 1 is 
nonzero, this implies that zq — p, and a contradiction has been reached. (This part 
of the argument is also closely related to the Daffodil Lemma of [9, p. 266].) □ 
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3.4. The asymptotics of p„. As a result of Lemmas [3] [5j the function f{z) has 
only one dominant singularity, and that singularity p is of the square-root type. 
One then has, for a family of constants hk, the local singular expansion: 

oo 

(13) fiz) = l + Y,hk{l-z/p)'+'/', 

k=0 

which is valid for z near p. The conditions of the singularity analysis process as 
summarized in [9, §VI.4] are then satisfied. Consequently, each singular element 
of ( |13[ ) relative to f{z) can be translated into a matching asymptotic term relative 
to [z"]/(z), according to the rule 

(14) <j{z)^{l^z/pf ^ [z-]a{z)^p-"('''^'^) ^p 



n J T{-9) 

In particular, we have [z "]/(0) - C ■ p-"n-3/2^ for some C. 

Hence Lemma |2] combined with Lemmas |4]-[5] and the routine asymptotics of 
nl/(2n — 3)!! by Stirling's formula, leads to the following theorem. 

Theorem 1. The probability that two phylogenetic trees of size n are isomorphic 
admits a complete asymptotic expansion 




(15) Pn-a- 6-" 

where a, b = Ap, and the Ck are computable constants, with values a = 3.17508 • • • , 
b = 2.35967- • • , and c\ approximately equal to —0.626. 

The function F{z, u) can be determined numerically to great accuracy (by means 
of the recursion corresponding to the functional equation ([9])). So, the value 

p== 0.58991 8271485535- •• , 



is obtained as the smallest positive root of (12 1; the constant a then similarly 
results from an evaluation of F' (p^, j^^); the constant ci, which could in principle 
be computed in the same manner, was, in our experiments, simply estimated from 



the values oipn for small n. The formula (151, truncated after its ci/n term, then 
appears to approximate p„ with a relative accuracy better than 10^^ for n > 5, 
10"^ for n > 38, and 10"^ for n > 47. 

4. Symmetrical Nodes and Automorphisms 

In the course of our investigations on analytic properties of the bivariate gener- 
ating function F(z,u), we came up with a few additional estimates, which improve 
on those of McKeon ^5]. In essence, what is at stake is a perturbative analysis 
of F(z, u) and its associated singular expansions, for various values of u, in a way 
that refines the developments of the previous section. We offer here a succinct ac- 
count: details can be easily supplemented by referring to Chapter IX of the book 
Analytic Combinatorics Pl. 

Theorem 2. (i) Let A"„ be the random variable representing the number of sym- 
metrical nodes in a random Otter tree of lAn ■ Then, A„ satisfies a limit law of 
Gaussian type, 

Vx e M : lim P (X„ < fm + ax^fn) = —= / e"™'/^ dw, 
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Figure 2. Histograms of the distribution of the number of symmetrical 
nodes in trees of size 100, compared to a matching Gaussian. Left: Otter 
trees of Wioo- Right: phylogenetic trees of Bioo- 



for some positive constants and a. Numerically, ji = 0.35869 • • • . 

(m) Let Yn be the random variable representing the number of symmetrical nodes 
in a random phylogenetic tree of Bn- Then, Yn satisfies a limit law of Gaussian 
type. 



Va; e M 




for some positive constants /i and a. Numerically, fl — 0.27104- • • . 

Proof (Sketch), (i) The case of Otter trees {XnMn)- In accordance, with general 
principles [9, Ch. IX], we need to estimate the generating polynomial 

(16) ^n{u):=[z^]F{z,u), 

when u is close to 1, with F{z, u) as specified by (|6| and ([t]). For u in a small enough 
complex neighborhood f2 of 1, the radius of convergence of F{z'^ , u^) is larger than 
some p2 > pi, where pi w 0.40269 is the radius of convergence associated with 
Otter trees. Then, by an argument similar to the ones used earlier, there exists a 
solution p{u) to the analytic equation 

(17) l-2p{u) + {u-l)F{p{uf,u^)^Q 

(compare with ( [12^ ), such that p(l) = pi is the dominant singularity of the generat- 
ing function F{z, 1) of Otter trees. By the analytic version of the implicit function 
theorem (equivalently, by the Weierstrass Preparation Theorem), this function p{u) 
depends analytically on u, for u near 1. 

In addition, by Q, the function F{z,u) has a singularity of the square-root 
type at p{u). Also, for u S f2 and f2 taken small enough, the triangle inequality 
combined with the previously established properties of F(z, 1) may be used to 
verify that there are no other singularities of z F{z,u) on \z\ = \p{u)\. There 
results, from singularity analysis and the uniformity of the process [S] p. 668], the 
asymptotic estimate 

(18) <p„(M) = c(u)p(u)-"n-3/^(l + o(l)), n^+oo, 
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uniformly with respect to u G f2, for some c{u) that is analytic at u = 1. Then, the 
probability generating function of X„, which equals (/3„(m)/(p„(1) satisfies what is 
known as a "quasi-powers approximation. That is, it resembles (analytically) the 
probability generating function of a sum of independent random variables, 

ipn{l) C(l) \p{u)J 

where sup^gQ kn('u)| tends to as n ^ oo. The Quasi-powers Theorem (see [9j 
§IX.5] and [13]) precisely applies to such approximations by quasi-powers and im- 
plies that the distribution of X„ is asymptotically normal. 

(m) The case of phylogenetic trees {Yn,Bn)- The starting point is a simple com- 



binatorial property of ipn{u), as defined in (16) 



tew„ teB„ 

(The first form results from the definition ^ of F{z,u); the second form relies on 
the expression ^ of the number of different labellings of an Otter tree that give rise 
to a phylogenetic tree.) Thus, fn taken with an argument near 1/2 serves, up to 
normalization, as the probability generating function of the number of symmetrical 
nodes in phylogenetic trees of K„ . 

From this point on, the analysis of symmetries in phylogenetic trees is entirely 
similar to that of Otter trees. For u in a small complex neighborhood of 1/2, 
the generating function z i-^ F{z, u) has a dominant singularity p{u) that is an 



analytic solution of (17 1 and is such that /o(l/2) = 1/2, the radius of convergence of 



B{z) = F{z, 1/2). As a consequence, estimates that parallel those of (18) and (19) 
are seen to hold, but with u G f2 now near 1/2. In particular, 

(^.^ Vn{u) J?(ti) /p(l/2)\" 

Vn(l/2) c(l/2) V p[u) J 

where '£n{u) — > uniformly. By the Quasi-powers Theorem (set u := v/2, with v 
near 1), the distribution of Yn is asymptotically normal. □ 

Figure [2] shows that the fit with a Gaussian is quite good, even for comparatively 
low sizes (n = 100). Phrased differently, the statement of Theorem [2] means that 
the logarithm of the order 2'^''™'^*^ of the automorphism group of a random tree t 
( either in lAn or in Bn) is normally distributed^ In the case oilAn, the expectation of 
the cardinality of this group has been determined by McKeon [15 to grow roughly 
as 1.33609". In the case of phylogenetic trees {Bn), we find an expected growth of 
the rough form 1.24162", where the exponential rate 1.24162 • • • is exactly l/(2pi), 
with pi, still, the radius of convergence of U{z) = F{z,l). (These values are 
consistent with the fact that trees with a higher number of symmetries admit a 
smaller number of labellings, hence are less likely to appear as "shapes", under the 
phylogenetic model S„.) 

As a matter of fact, the histograms of Figure [2] suggest that a convergence 
stronger than a plain convergence in law (corresponding to convergence of the 
distribution function) holds. 



^The situation is loosely evocative of the fact (Erdos— Turan Theorem) that the logarithm of 
the order of a random permutation of size n is normally distributed; see, e.g., |5l lllllT6| . 
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Definition 1. Let (^„) be a family of random variables with expectation /i„ = E(^„) 
and variance cr^ = V(^„). It is said to satisfy a local limit law with density g{x) if 
one has 

(22) lim sup |cr„ P(^„ = [/i„ + a;cr„J ) - g{x) \ = 0. 

In other terms, we expect the probabihty of ^„ being at x standard deviations away 
from its mean to be well approximated by g{x)/an- This concept is discussed in 
the case of sums of random variables by Gnedeneko and Kolmogorov in |10l Ch. 9] 
and, in a broader combinatorial context, by Bender [IJ and Flajolet-Sedgewick [5J 
§IX.9]. 

Theorem 3. The number of symmetrical nodes in either an unlabeled tree (Xn 
on Un) or a phylogenetic tree (Yn on Bn) satisfies a local limit law of the Gaussian 
type. That is, in the sense of Definition^ a local limit law holds, with density 

1 



/27r 



Proof, (i) The unlabeled case [Xmhin)- The proof essentially boils down to estab- 
lishing that 

fn{u)^[z^]F{z,u) 

is small compared to [z"]F(z,l), as soon as u satisfies |m| = 1 and stays away 
from 1; then. Theorem IX. 14, p. 696, from [FlSeOS] does the rest. The arguments 
are variations of the ones previously used. 

Since a tree of size n has less than n symmetrical nodes, we have < 
|u|"/„(l) for any \u\ > 1. There results that the convergence of the series expansion 
of F{z,u) is dominated by that of i^(|zu|,l), whenever |m| > 1. Apply the fact 
explained in the previous sentence, with and instead of z and u, to get that 
the coefficients of F{z'^, u^) are less than the coefficients of F{\z'^u'^\, 1), where the 
latter series is convergent if < 0.625, or in other words, \zu\ < 0.75, say. Now 

choose rj so that (1 + i]){pi + rj) < 0.75, where pi is the radius of convergence of 
Otter trees (pi = p{l) « 0.40269). Then F{z'^,u^) is bivariate analytic whenever 
|z| < {pi + vi) and |u| < 1 + rj. In accordance with previously developed arguments, 
this implies that, for any fixed u satisfying |m| < 1 + rj, the function z i—t F{z, u) 
has only finitely many singularities, each of the square-root type, in \z\ < pi + rj. 

For 7i in a small complex neighborhood of 1, we already know that z i— > F{z, u) 
has only one dominant singularity at some p{u), which is a root of 

1 - 2p{u) + (2u - l)F{p{uf,u^) = 0. 

(This property lies at the basis of the central limit law of the previous theorem.) 

Consider now a u such that |m| = 1, but u ^ We argue that z i— > F{z,u) is 
analytic at all points z such that \z\ = pi. Indeed for such values of u and z, we 
have, by the strong triangle inequality, 

(23) \F{z,u)\ < F{pi,l), 

the reason being that, in the expansion F{z, u) = z + uz^ -I- uz^ + ■ ■ • , the values 



of the monomials u'^z" cannot be all collinear, unless u — 1. The inequality ( |23[ ) 
combined with the fact that F{p\,V) = 1 implies that z ^ F{z,u) cannot be 
singular (since, as we know, the only possibility for a singularity would be that it 
is of the square-root type and F{z, u) = 1). 
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Thus, for |m| = 1 and u ^ fl, the function z i-^ F{z, u) is analytic at all points 
of \z\ = pi- Hence, it is analytic in \z\ < pi+S, for some 6 > 0. By usual exponential 
bounds, there results that, for some K > 0, one has 



1, 



n. 



(24) \Uu)\<Kip, + S/2)-\ |. 

As expressed by Theorem IX.l^of [9J, the existence of a quasi-powers approxima- 
tion (when u is near 1), as in (18 1 and (19 1, and of the exponentially small bound 
(when w ^ il is away from 1), as provided by (24 1, suffices to ensure the existence 
of a local limit law. 

(a) The labeled case (Yn,Bn)- In accordance with (20l, the function F(z,u/2) is 
the bivariate exponential generating function of phylogenetic trees, with z marking 
size and u marking the number of symmetrical nodes. Consider once more |u| = 1 
and distinguish the two cases u £ (for which the proof of Theorem [2] provides a 
quasi-powers approximation) and u ^ il. In the latter case, arguments that entirely 
parallel those applied to unlabeled trees give us that z h-> F{z, u/2) has no singu- 
larity on \z\ — 1/2. This implies, for u ^ fl, the exponential smallness of ^„(u/2), 
as defined in (20 1, resulting in an estimate that parallels (24 1. Theorem IX. 14 of [S] 
again enables us to conclude as to the existence of a local limit law. □ 



5. Coincidence of the Number of Symmetries 

From a statistician's point of view, it may be of interest to determine the prob- 
ability for two trees to be similar" (rather than plainly isomorphic), given some 
structural similarity distance between non-plane trees — see, for instance, the work 
of Ycart and Van Cutsem |21j for a study conducted under probabilistic assump- 
tions that differ from ours. Combinatorial generating functions can still be useful 
in this broad range of problems, as we now show by considering the following ques- 
tion: determine the probability that two randomly chosen trees r, r' of the same size 
have the same number of symmetrical nodes. This probability a priori lies in the 
interval [^,1]; we shall see, in Theorem Wl that its asymptotic value is "in-between" . 

The problem under consideration belongs to an orbit of questions occasionally 
touched upon in the literature. For instance, Wilf showed that the probabil- 
ity that two permutations of size n have the same number of cycles is asymptotic 
to (2-\/7rTogn)~^; Bona and Knopfmacher [21 examine combinatorially and asymp- 
totically the probability that various types of integer compositions have the same 
number of parts, and several other coincidence probabilities are studied in [7]. The 
following basic lemma trivializes the asymptotic side of several such questions. 

Lemma 6. Let C be a combinatorial class equipped with an integer-valued para- 
meter X- Assume that the random variable corresponding to x restricted to C„ 
(under the uniform distribution overCn) satisfies a local limit law with density g{x), 
in the sense of Definition [7| Let the variance of x on Cn be and assume that 
g{x) is continuously differentiable. Then, the probability that two objects c, c' € C„ 



The reasoning corresponding to that theorem is simple: start from 

Use ([24J1 to neglect the contribution corresponding to m ^ fi; appeal to the saddle point method 
applied to the quasi-powers approximation to estimate the central part u & CI, and conclude. 
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admit the same value of x satisfies the asymptotic estimate 

K 



(25) 



/oo 
g{x)^ dx. 
-oo 



Note that, for g{x) the standard Gaussian density, one has K = l/(2^/7r). 

Proof (sketch). Let Ci7„ be the probabihty of coincidence; that is, the left hand-side 



of (25 1. Observe that, by hypothesis, we must have cr„ — s- oo. The basehne is that 



k 

1 V- . .0 . , „ 1 



oo 



g{x) , with £n := — {Z>o - {fin}) , A^n ^cAx] 



g{x)^ dx. 

To justify this chain rigorously, first restrict attention to values of a; in a finite 
interval [—A, +B], so that the tails (/^^ + Jyg)g are less than some small e. Then, 
with X € [—A, +B], make use of the approximation ( |22[ ) provided by the assumption 
of a local limit law. Next, approximate the sum of g{x)^ taken at regularly spaced 
sampling points (a Riemann sum) by the corresponding integral. Finally, complete 
back the tails. □ 

Given the local limit law expressed by Theorem |3] an immediate consequence of 
Lemma |6] is the following. 

Theorem 4. For Otter trees ihin) and phylogenetic trees (S„), the asymptotic 
probabilities that two trees of size n have the same number of symmetries admit the 
forms 

U ■ B ■ ^— 

where a, a are the two "variance constants" of Theorem^ 

In summary, as we see in several particular cases here, qualitatively similar phe- 
nomena are expected in trees, whether plane or non-plane trees, labelled or unla- 
beled, whereas, quantitatively, the structure constants (for instance, fi and /I in 
Theorem [2j a and a in Theorem |4]) tend to be model-specific. Yet another instance 
of such universality phenomena is the height of Otter trees, analysed in [3j , which 
is to be compared to the height of plane binary trees [8] : both scale to and lead 
to the same elliptic-theta distribution, albeit with different scaling factors. 
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